Aim:
The aim of this project is to apply classification techniques to predict
novel transcription factor target genes, focusing on Sox2 and Nanog
during embryonic stem cell (ESC) differentiation.
Background:
Transcriptional regulation is a fundamental process in all living
organisms, driven by transcription factors that control mRNA expression.
These transcriptional networks are crucial in development, lineage
specification, and cell fate decisions during early embryonic
development (Theunissen
and Jaenisch, 2017). Recent advances in omics technologies allow for
the profiling of genome-wide transcriptional and epigenetic events,
providing a deeper understanding of these networks.
Aim: The aim of this project is to apply classification techniques to predict novel transcription factor target genes, focusing on Sox2 and Nanog during embryonic stem cell (ESC) differentiation from muliomics data.
Background: Transcriptional regulation is a fundamental ess in all living organisms, driven by transcription factors that control mRNA expression. These transcriptional networks are crucial in development, lineage specification, and cell fate decisions during early embryonic development (Theunissen and Jaenisch, 2017). Recent advances in omics technologies allow for the profiling of genome-wide transcriptional and epigenetic events, providing a deeper understanding of these networks.
>>>>>>> 93663ae (Add theme and format)In this project, we utilize high-temporal-resolution multi-omics data of ESC differentiation (Yang et al., 2019) to predict novel substrates of Sox2 and Nanog—two key transcription factors involved in maintaining pluripotency and guiding cell differentiation.
Dataset Overview: - Transcriptome: Time-course mRNA profiles during ESC differentiation. - Proteome: Time-course protein expression profiles during ESC differentiation. - Epigenome: Time-course ESC differentation epigonme profiles of 6 histone marks.
We will develop and validate a classification model to predict novel transcription factor target genes, focusing on Sox2 and Nanog, using the provided multi-omics datasets.
We start by loading the necessary R packages and the dataset
Final_Project_ESC.RData, which contains the transcriptome,
proteome, and epigenome data, along with a subset of known Sox2/Nanog
target genes.
# Load necessary packages and data
load("Final_Project_ESC.RData", verbose = TRUE)
## Loading objects:
## Transcriptome
## Proteome
## H3K4me3
## H3K27me3
## PolII
## H3K4me1
## H3K27ac
## H3K9me2
## cMyc_target_genes
## cMyc_target_genes_subset
## OSN_target_genes
## OSN_target_genes_subset
suppressPackageStartupMessages({
library(e1071)
library(ggplot2)
library(ROCR)
library(calibrate)
library(dplyr)
library(tibble)
library(reshape2)
library(kernlab)
library(caret)
library(randomForest)
library(adabag)
library(gbm)
library(xgboost)
library(pROC)
library(doParallel)
library(calibrate)
})
#Describe and explore the Data set details #Transcriptome
head(Transcriptome) dim(Transcriptome) colnames(Transcriptome)
cor.mat <- cor(Transcriptome) pca.mat <- prcomp(cor.mat)
grp <- rownames(pca.mat\(x) grp.col <- rainbow(nrow(pca.mat\)x)) names(grp.col) <- rownames(pca.mat$x)
plot(pca.mat\(x[,1], pca.mat\)x[,2], col=grp.col[grp], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.mat)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.mat)\)importance[2,2]*100,1), “% variance)”))
calibrate::textxy(pca.mat\(x[,1], pca.mat\)x[,2], labs=grp, cex=0.5)
#Proteome cor.proteome <- cor(Proteome) pca.proteome <- prcomp(cor.proteome) summary(pca.proteome)$importance
cor.proteome <- cor(Proteome) pca.proteome <- prcomp(cor.proteome)
grp <- rownames(pca.proteome\(x) # Set groups according to your data grp.col <- rainbow(nrow(pca.proteome\)x)) names(grp.col) <- rownames(pca.proteome$x)
plot(pca.proteome\(x[,1], pca.proteome\)x[,2], col=grp.col[grp], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.proteome)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.proteome)\)importance[2,2]*100,1), “% variance)”))
#General PCA and Plotting Code for Epigenome Data #h3k4me3 # PCA analysis on the correlation matrix of the H3K4me3 data cor.h3k4me3 <- cor(H3K4me3) pca.h3k4me3 <- prcomp(cor.h3k4me3)
grp <- rownames(pca.h3k4me3\(x) grp.col <- rainbow(nrow(pca.h3k4me3\)x)) names(grp.col) <- rownames(pca.h3k4me3$x)
plot(pca.h3k4me3\(x[,1], pca.h3k4me3\)x[,2], col=grp.col[grp], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.h3k4me3)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.h3k4me3)\)importance[2,2]*100,1), “% variance)”))
calibrate::textxy(pca.h3k4me3\(x[,1], pca.h3k4me3\)x[,2], labs=grp, cex=0.5)
#H3K27me3 # PCA analysis for H3K27me3 data cor.h3k27me3 <- cor(H3K27me3) pca.h3k27me3 <- prcomp(cor.h3k27me3)
grp.h3k27ac <- rownames(pca.h3k27ac\(x) grp.col.h3k27ac <- rainbow(nrow(pca.h3k27ac\)x)) names(grp.col.h3k27ac) <- rownames(pca.h3k27ac$x)
plot(pca.h3k27ac\(x[,1], pca.h3k27ac\)x[,2], col=grp.col.h3k27ac[grp.h3k27ac], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.h3k27ac)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.h3k27ac)\)importance[2,2]*100,1), “% variance)”))
calibrate::textxy(pca.h3k27ac\(x[,1], pca.h3k27ac\)x[,2], labs=grp.h3k27ac, cex=0.5)
#H3K4me1 # PCA analysis for H3K4me1 data cor.h3k4me1 <- cor(H3K4me1) pca.h3k4me1 <- prcomp(cor.h3k4me1)
grp.h3k4me1 <- rownames(pca.h3k4me1\(x) grp.col.h3k4me1 <- rainbow(nrow(pca.h3k4me1\)x)) names(grp.col.h3k4me1) <- rownames(pca.h3k4me1$x)
plot(pca.h3k4me1\(x[,1], pca.h3k4me1\)x[,2], col=grp.col.h3k4me1[grp.h3k4me1], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.h3k4me1)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.h3k4me1)\)importance[2,2]*100,1), “% variance)”))
calibrate::textxy(pca.h3k4me1\(x[,1], pca.h3k4me1\)x[,2], labs=grp.h3k4me1, cex=0.5)
#H3K9me2 # PCA analysis for H3K9me2 data cor.h3k9me2 <- cor(H3K9me2) pca.h3k9me2 <- prcomp(cor.h3k9me2)
grp.h3k9me2 <- rownames(pca.h3k9me2\(x) grp.col.h3k9me2 <- rainbow(nrow(pca.h3k9me2\)x)) names(grp.col.h3k9me2) <- rownames(pca.h3k9me2$x)
plot(pca.h3k9me2\(x[,1], pca.h3k9me2\)x[,2], col=grp.col.h3k9me2[grp.h3k9me2], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.h3k9me2)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.h3k9me2)\)importance[2,2]*100,1), “% variance)”))
calibrate::textxy(pca.h3k9me2\(x[,1], pca.h3k9me2\)x[,2], labs=grp.h3k9me2, cex=0.5)
#PolII # PCA analysis for PolII data cor.polii <- cor(PolII) pca.polii <- prcomp(cor.polii)
grp.polii <- rownames(pca.polii\(x) grp.col.polii <- rainbow(nrow(pca.polii\)x)) names(grp.col.polii) <- rownames(pca.polii$x)
plot(pca.polii\(x[,1], pca.polii\)x[,2], col=grp.col.polii[grp.polii], pch=19, cex=2, xlab=paste0(“PC1 (”, round(summary(pca.polii)\(importance[2,1]*100,1), "% variance)"), ylab=paste0("PC2 (", round(summary(pca.polii)\)importance[2,2]*100,1), “% variance)”))
calibrate::textxy(pca.polii\(x[,1], pca.polii\)x[,2], labs=grp.polii, cex=0.5)
=======Before beginning data analysis it is important understand and investigate the data. The goal of this report is to predict to predict novel transcription factor target genes from multi-omics data. For each of our datasets lets look at the strcuture of the data and perform PCA Analysis as means of identifying trends in the dataset.
Below is the temporal expression of the Triptome, Proteome, and Epigiome datasets at timepoints 0, 1, 6, 12, 24, 36, 48 hours.
head(Transcriptome)
## 0hr 1hr 6hr 12hr 24hr
## GNAI3 8.881784e-16 0.29131114 0.618947976 0.48337689 0.71864514
## CDC45 0.000000e+00 0.25062323 0.199000752 0.38800303 0.47163966
## H19 0.000000e+00 0.38475910 0.471216601 0.55341685 1.87279375
## SCML2 8.881784e-16 0.64099671 0.917100751 0.73622935 0.74708761
## NARF -8.881784e-16 -0.40736349 -1.186942510 -0.69986049 -0.15754727
## CAV2 5.551115e-17 0.01608049 0.002154149 0.07278802 0.06737809
## 36hr 48hr 72hr
## GNAI3 0.90833508 0.63408090 0.81560693
## CDC45 0.20338260 -0.03095025 -0.55159826
## H19 2.77535428 4.17558703 4.34079023
## SCML2 0.71202112 -0.20856228 -1.12588475
## NARF -0.73012533 -0.11611985 -0.01534405
## CAV2 0.08131677 0.46840687 1.32719756
dim(Transcriptome)
## [1] 19788 8
colnames(Transcriptome)
## [1] "0hr" "1hr" "6hr" "12hr" "24hr" "36hr" "48hr" "72hr"
# PCA analysis on the correlation matrix of the transcriptome data
cor.mat <- cor(Transcriptome)
pca.mat <- prcomp(cor.mat)
# Plot the PCA
grp <- rownames(pca.mat$x)
grp.col <- rainbow(nrow(pca.mat$x))
names(grp.col) <- rownames(pca.mat$x)
# Generate PCA plot
plot(pca.mat$x[,1], pca.mat$x[,2], col=grp.col[grp], pch=19, cex=2,
xlab=paste0("PC1 (", round(summary(pca.mat)$importance[2,1]*100,1), "% variance)"),
ylab=paste0("PC2 (", round(summary(pca.mat)$importance[2,2]*100,1), "% variance)"))
# Add sample labels to the plot
calibrate::textxy(pca.mat$x[,1], pca.mat$x[,2], labs=grp, cex=0.5)
cor.proteome <- cor(Proteome)
pca.proteome <- prcomp(cor.proteome)
summary(pca.proteome)$importance
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 0.5233315 0.2915203 0.1624012 0.1053089 0.07579296
## Proportion of Variance 0.6763300 0.2098700 0.0651300 0.0273900 0.01419000
## Cumulative Proportion 0.6763300 0.8862000 0.9513300 0.9787100 0.99290000
## PC6 PC7
## Standard deviation 0.05362587 2.715731e-18
## Proportion of Variance 0.00710000 0.000000e+00
## Cumulative Proportion 1.00000000 1.000000e+00
# Using the previous correlation matrix and PCA results
cor.proteome <- cor(Proteome)
pca.proteome <- prcomp(cor.proteome)
# Get group labels and colors
grp <- rownames(pca.proteome$x) # Set groups according to your data
grp.col <- rainbow(nrow(pca.proteome$x))
names(grp.col) <- rownames(pca.proteome$x)
# Plot the PCA
plot(pca.proteome$x[,1], pca.proteome$x[,2], col=grp.col[grp], pch=19, cex=2,
xlab=paste0("PC1 (", round(summary(pca.proteome)$importance[2,1]*100,1), "% variance)"),
ylab=paste0("PC2 (", round(summary(pca.proteome)$importance[2,2]*100,1), "% variance)"))
# PCA analysis on the correlation matrix of the H3K4me3 data
cor.h3k4me3 <- cor(H3K4me3)
pca.h3k4me3 <- prcomp(cor.h3k4me3)
# Get group labels and colors
grp <- rownames(pca.h3k4me3$x)
grp.col <- rainbow(nrow(pca.h3k4me3$x))
names(grp.col) <- rownames(pca.h3k4me3$x)
# Generate PCA plot for H3K4me3
plot(pca.h3k4me3$x[,1], pca.h3k4me3$x[,2], col=grp.col[grp], pch=19, cex=2,
xlab=paste0("PC1 (", round(summary(pca.h3k4me3)$importance[2,1]*100,1), "% variance)"),
ylab=paste0("PC2 (", round(summary(pca.h3k4me3)$importance[2,2]*100,1), "% variance)"))
# Add sample labels to the plot
calibrate::textxy(pca.h3k4me3$x[,1], pca.h3k4me3$x[,2], labs=grp, cex=0.5)
# PCA analysis for H3K27me3 data
cor.h3k27me3 <- cor(H3K27me3)
pca.h3k27me3 <- prcomp(cor.h3k27me3)
# Update group labels and colors for H3K27me3
grp.h3k27me3 <- rownames(pca.h3k27me3$x)
grp.col.h3k27me3 <- rainbow(nrow(pca.h3k27me3$x))
names(grp.col.h3k27me3) <- rownames(pca.h3k27me3$x)
# Generate PCA plot for H3K27me3
plot(pca.h3k27me3$x[,1], pca.h3k27me3$x[,2], col=grp.col.h3k27me3[grp.h3k27me3], pch=19, cex=2,
xlab=paste0("PC1 (", round(summary(pca.h3k27me3)$importance[2,1]*100,1), "% variance)"),
ylab=paste0("PC2 (", round(summary(pca.h3k27me3)$importance[2,2]*100,1), "% variance)"))
# Correctly label the samples for H3K27me3
calibrate::textxy(pca.h3k27me3$x[,1], pca.h3k27me3$x[,2], labs=grp.h3k27me3, cex=0.5)
# PCA analysis for H3K27ac data
cor.h3k27ac <- cor(H3K27ac)
pca.h3k27ac <- prcomp(cor.h3k27ac)
# Update group labels and colors for H3K27ac
grp.h3k27ac <- rownames(pca.h3k27ac$x)
grp.col.h3k27ac <- rainbow(nrow(pca.h3k27ac$x))
names(grp.col.h3k27ac) <- rownames(pca.h3k27ac$x)
# Generate PCA plot for H3K27ac
plot(pca.h3k27ac$x[,1], pca.h3k27ac$x[,2], col=grp.col.h3k27ac[grp.h3k27ac], pch=19, cex=2,
xlab=paste0("PC1 (", round(summary(pca.h3k27ac)$importance[2,1]*100,1), "% variance)"),
ylab=paste0("PC2 (", round(summary(pca.h3k27ac)$importance[2,2]*100,1), "% variance)"))
# Correctly label the samples for H3K27ac
calibrate::textxy(pca.h3k27ac$x[,1], pca.h3k27ac$x[,2], labs=grp.h3k27ac, cex=0.5)
# PCA analysis for H3K4me1 data
cor.h3k4me1 <- cor(H3K4me1)
pca.h3k4me1 <- prcomp(cor.h3k4me1)
# Define the group labels and colors specifically for H3K4me1 data
grp.h3k4me1 <- rownames(pca.h3k4me1$x)
grp.col.h3k4me1 <- rainbow(nrow(pca.h3k4me1$x))
names(grp.col.h3k4me1) <- rownames(pca.h3k4me1$x)
# Generate PCA plot for H3K4me1
plot(pca.h3k4me1$x[,1], pca.h3k4me1$x[,2], col=grp.col.h3k4me1[grp.h3k4me1], pch=19, cex=2,
xlab=paste0("PC1 (", round(summary(pca.h3k4me1)$importance[2,1]*100,1), "% variance)"),
ylab=paste0("PC2 (", round(summary(pca.h3k4me1)$importance[2,2]*100,1), "% variance)"))
# Correctly label the samples for H3K4me1
calibrate::textxy(pca.h3k4me1$x[,1], pca.h3k4me1$x[,2], labs=grp.h3k4me1, cex=0.5)
# PCA analysis for H3K9me2 data
cor.h3k9me2 <- cor(H3K9me2)
pca.h3k9me2 <- prcomp(cor.h3k9me2)
# Define the group labels and colors specifically for H3K9me2 data
grp.h3k9me2 <- rownames(pca.h3k9me2$x)
grp.col.h3k9me2 <- rainbow(nrow(pca.h3k9me2$x))
names(grp.col.h3k9me2) <- rownames(pca.h3k9me2$x)
# Generate PCA plot for H3K9me2
plot(pca.h3k9me2$x[,1], pca.h3k9me2$x[,2], col=grp.col.h3k9me2[grp.h3k9me2], pch=19, cex=2,
xlab=paste0("PC1 (", round(summary(pca.h3k9me2)$importance[2,1]*100,1), "% variance)"),
ylab=paste0("PC2 (", round(summary(pca.h3k9me2)$importance[2,2]*100,1), "% variance)"))
# Correctly label the samples for H3K9me2
calibrate::textxy(pca.h3k9me2$x[,1], pca.h3k9me2$x[,2], labs=grp.h3k9me2, cex=0.5)
# PCA analysis for PolII data
cor.polii <- cor(PolII)
pca.polii <- prcomp(cor.polii)
# Define the group labels and colors specifically for PolII data
grp.polii <- rownames(pca.polii$x)
grp.col.polii <- rainbow(nrow(pca.polii$x))
names(grp.col.polii) <- rownames(pca.polii$x)
# Generate PCA plot for PolII
plot(pca.polii$x[,1], pca.polii$x[,2], col=grp.col.polii[grp.polii], pch=19, cex=2,
xlab=paste0("PC1 (", round(summary(pca.polii)$importance[2,1]*100,1), "% variance)"),
ylab=paste0("PC2 (", round(summary(pca.polii)$importance[2,2]*100,1), "% variance)"))
# Correctly label the samples for PolII
calibrate::textxy(pca.polii$x[,1], pca.polii$x[,2], labs=grp.polii, cex=0.5)
In order to properly model and predicting novel transcription factors target genes we join the 8 datasets together and perform our calculations on the larger dataset.
To ensure consistency across datasets, we filter each dataset to include only the common genes present in all omics layers. We then combine these filtered datasets into a single data frame for further analysis.
# Ensure all data has the same set of genes
genes <- intersect(rownames(Transcriptome), rownames(Proteome))
genes <- intersect(genes, rownames(H3K4me3))
genes <- intersect(genes, rownames(H3K27me3))
genes <- intersect(genes, rownames(H3K27ac))
genes <- intersect(genes, rownames(H3K4me1))
genes <- intersect(genes, rownames(H3K9me2))
genes <- intersect(genes, rownames(PolII))
# Filter each dataset for the common genes
Transcriptome_filter <- Transcriptome[genes, ]
Proteome_filter <- Proteome[genes, ]
H3K4me3_filter <- H3K4me3[genes, ]
H3K27me3_filter <- H3K27me3[genes, ]
H3K27ac_filter <- H3K27ac[genes, ]
H3K4me1_filter <- H3K4me1[genes, ]
H3K9me2_filter <- H3K9me2[genes, ]
<<<<<<< HEAD
PolII_filter <- PolII[genes, ]
=======
PolII_filter <- PolII[genes, ]
# Confirm that all datasets share the same gene
identical(rownames(Transcriptome_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(Proteome_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(H3K27ac_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(H3K4me1_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(H3K9me2_filter), rownames(H3K4me3_filter))
## [1] TRUE
identical(rownames(PolII_filter), rownames(H3K4me3_filter))
## [1] TRUE
>>>>>>> 93663ae (Add theme and format)
# Rename columns to avoid conflicts
colnames(Transcriptome_filter) <- paste("T_", colnames(Transcriptome_filter), sep = "")
colnames(Proteome_filter) <- paste("P_", colnames(Proteome_filter), sep = "")
colnames(H3K4me3_filter) <- paste("H3K4me3_", colnames(H3K4me3_filter), sep = "")
colnames(H3K27me3_filter) <- paste("H3K27me3_", colnames(H3K27me3_filter), sep = "")
colnames(H3K27ac_filter) <- paste("H3K27ac_", colnames(H3K27ac_filter), sep = "")
colnames(H3K4me1_filter) <- paste("H3K4me1_", colnames(H3K4me1_filter), sep = "")
colnames(H3K9me2_filter) <- paste("H3K9me2_", colnames(H3K9me2_filter), sep = "")
colnames(PolII_filter) <- paste("PolII_", colnames(PolII_filter), sep = "")
# Combine the datasets
combined_data <- cbind(
Transcriptome_filter,
Proteome_filter,
H3K4me3_filter,
H3K27me3_filter,
H3K27ac_filter,
H3K4me1_filter,
H3K9me2_filter,
PolII_filter
)
# Add the labels
label <- ifelse(genes %in% OSN_target_genes_subset, "OSN", "Other")
combined_data <- data.frame(combined_data)
combined_data$label <- factor(label)
<<<<<<< HEAD
=======
# Number of genes which are known to be targets for Sox2 and Nanog
length(OSN_target_genes_subset)
## [1] 100
We have 100 known target genes for OSN, and as seen below the we have 95 genes that have been identified as novel Sox2/Nanog targets on our combined filtered dataset.
>>>>>>> 93663ae (Add theme and format)# Check the initial label distribution
print(table(combined_data$label))
##
## OSN Other
## 95 8085
<<<<<<< HEAD
The dataset is split into training (90%) and testing (10%) sets. The label column is reassigned to the test set to ensure that it is included correctly.
# Split the dataset into training (90%) and testing (10%) sets
set.seed(123)
train_index <- createDataPartition(combined_data$label, p = 0.9, list = FALSE)
train_data <- combined_data[train_index, ]
test_data <- combined_data[-train_index, ]
# Reassign the label column to test_data
test_data$label <- combined_data[-train_index, "label"]
# Check the distribution of labels in the training and test sets
print("Training set label distribution:")
## [1] "Training set label distribution:"
print(table(train_data$label))
##
## OSN Other
## 86 7277
print("Test set label distribution:")
## [1] "Test set label distribution:"
print(table(test_data$label))
##
## OSN Other
## 9 808
Since the dataset is imbalanced, downsampling is used to create a balanced training dataset. This ensures that both classes (OSN and Other) are equally represented, improving the robustness of the classification model.
# Balance the training data using downsampling
set.seed(123)
downsampled_train_data <- downSample(x = train_data[, -ncol(train_data)],
y = train_data$label,
yname = "label")
# Check the balanced label distribution
print("Balanced training set label distribution:")
## [1] "Balanced training set label distribution:"
print(table(downsampled_train_data$label))
##
## OSN Other
## 86 86
# Final check of training dataset dimensions
print(dim(downsampled_train_data))
## [1] 172 64
Train SVM and Random Forest Models We train two machine learning models, SVM and Random Forest, using the balanced training dataset.
=======We train two machine learning models, SVM (with the radial kernel) and Random Forest, using the balanced training dataset.
>>>>>>> 93663ae (Add theme and format)# Set a consistent seed
set.seed(123)
# Train an SVM model on the downsampled training data with radial basis function kernel
svm_model <- svm(label ~ ., data = downsampled_train_data, kernel = "radial", probability = TRUE)
# Train a Random Forest model on the downsampled training data
set.seed(123)
rf_model <- randomForest(label ~ ., data = downsampled_train_data, ntree = 1000)
<<<<<<< HEAD
bagged_trees <- train(
label ~ .,
data = downsampled_train_data,
method = "treebag",
trControl = trainControl(method = "cv", number = 5),
tuneLength = 5
)
print(bagged_trees)
## Bagged CART
##
## 172 samples
## 63 predictor
## 2 classes: 'OSN', 'Other'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 137, 138, 138, 137, 138
## Resampling results:
##
## Accuracy Kappa
## 0.7447059 0.4888148
For hyperparameters tuning we define the grid of paramaters below and train an XGB Model on the downsampled data. We use cross validation as means to providing a reliable model that is not overfit, and allow the model to be trained in parallel.
tune_grid_xgb <- expand.grid(
nrounds = c(100, 200),
max_depth = c(3, 6),
eta = c(0.1, 0.3),
gamma = c(0, 0.1),
colsample_bytree = c(0.5, 1),
min_child_weight = c(1, 10),
subsample = c(0.5, 1)
)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
train_control <- trainControl(
method = "cv",
number = 3,
savePredictions = "final",
verboseIter = TRUE,
allowParallel = TRUE
)
<<<<<<< HEAD
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 100, max_depth = 3, eta = 0.1, gamma = 0.1, colsample_bytree = 1, min_child_weight = 1, subsample = 0.5 on full training set
print(xgb_model_tuned)
## eXtreme Gradient Boosting
##
## 172 samples
## 63 predictor
## 2 classes: 'OSN', 'Other'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 114, 116, 114
## Resampling results across tuning parameters:
##
## eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 0.1 3 0.0 0.5 1 0.5 100
## 0.1 3 0.0 0.5 1 0.5 200
## 0.1 3 0.0 0.5 1 1.0 100
## 0.1 3 0.0 0.5 1 1.0 200
## 0.1 3 0.0 0.5 10 0.5 100
## 0.1 3 0.0 0.5 10 0.5 200
## 0.1 3 0.0 0.5 10 1.0 100
## 0.1 3 0.0 0.5 10 1.0 200
## 0.1 3 0.0 1.0 1 0.5 100
## 0.1 3 0.0 1.0 1 0.5 200
## 0.1 3 0.0 1.0 1 1.0 100
## 0.1 3 0.0 1.0 1 1.0 200
## 0.1 3 0.0 1.0 10 0.5 100
## 0.1 3 0.0 1.0 10 0.5 200
## 0.1 3 0.0 1.0 10 1.0 100
## 0.1 3 0.0 1.0 10 1.0 200
## 0.1 3 0.1 0.5 1 0.5 100
## 0.1 3 0.1 0.5 1 0.5 200
## 0.1 3 0.1 0.5 1 1.0 100
## 0.1 3 0.1 0.5 1 1.0 200
## 0.1 3 0.1 0.5 10 0.5 100
## 0.1 3 0.1 0.5 10 0.5 200
## 0.1 3 0.1 0.5 10 1.0 100
## 0.1 3 0.1 0.5 10 1.0 200
## 0.1 3 0.1 1.0 1 0.5 100
## 0.1 3 0.1 1.0 1 0.5 200
## 0.1 3 0.1 1.0 1 1.0 100
## 0.1 3 0.1 1.0 1 1.0 200
## 0.1 3 0.1 1.0 10 0.5 100
## 0.1 3 0.1 1.0 10 0.5 200
## 0.1 3 0.1 1.0 10 1.0 100
## 0.1 3 0.1 1.0 10 1.0 200
## 0.1 6 0.0 0.5 1 0.5 100
## 0.1 6 0.0 0.5 1 0.5 200
## 0.1 6 0.0 0.5 1 1.0 100
## 0.1 6 0.0 0.5 1 1.0 200
## 0.1 6 0.0 0.5 10 0.5 100
## 0.1 6 0.0 0.5 10 0.5 200
## 0.1 6 0.0 0.5 10 1.0 100
## 0.1 6 0.0 0.5 10 1.0 200
## 0.1 6 0.0 1.0 1 0.5 100
## 0.1 6 0.0 1.0 1 0.5 200
## 0.1 6 0.0 1.0 1 1.0 100
## 0.1 6 0.0 1.0 1 1.0 200
## 0.1 6 0.0 1.0 10 0.5 100
## 0.1 6 0.0 1.0 10 0.5 200
## 0.1 6 0.0 1.0 10 1.0 100
## 0.1 6 0.0 1.0 10 1.0 200
## 0.1 6 0.1 0.5 1 0.5 100
## 0.1 6 0.1 0.5 1 0.5 200
## 0.1 6 0.1 0.5 1 1.0 100
## 0.1 6 0.1 0.5 1 1.0 200
## 0.1 6 0.1 0.5 10 0.5 100
## 0.1 6 0.1 0.5 10 0.5 200
## 0.1 6 0.1 0.5 10 1.0 100
## 0.1 6 0.1 0.5 10 1.0 200
## 0.1 6 0.1 1.0 1 0.5 100
## 0.1 6 0.1 1.0 1 0.5 200
## 0.1 6 0.1 1.0 1 1.0 100
## 0.1 6 0.1 1.0 1 1.0 200
## 0.1 6 0.1 1.0 10 0.5 100
## 0.1 6 0.1 1.0 10 0.5 200
## 0.1 6 0.1 1.0 10 1.0 100
## 0.1 6 0.1 1.0 10 1.0 200
## 0.3 3 0.0 0.5 1 0.5 100
## 0.3 3 0.0 0.5 1 0.5 200
## 0.3 3 0.0 0.5 1 1.0 100
## 0.3 3 0.0 0.5 1 1.0 200
## 0.3 3 0.0 0.5 10 0.5 100
## 0.3 3 0.0 0.5 10 0.5 200
## 0.3 3 0.0 0.5 10 1.0 100
## 0.3 3 0.0 0.5 10 1.0 200
## 0.3 3 0.0 1.0 1 0.5 100
## 0.3 3 0.0 1.0 1 0.5 200
## 0.3 3 0.0 1.0 1 1.0 100
## 0.3 3 0.0 1.0 1 1.0 200
## 0.3 3 0.0 1.0 10 0.5 100
## 0.3 3 0.0 1.0 10 0.5 200
## 0.3 3 0.0 1.0 10 1.0 100
## 0.3 3 0.0 1.0 10 1.0 200
## 0.3 3 0.1 0.5 1 0.5 100
## 0.3 3 0.1 0.5 1 0.5 200
## 0.3 3 0.1 0.5 1 1.0 100
## 0.3 3 0.1 0.5 1 1.0 200
## 0.3 3 0.1 0.5 10 0.5 100
## 0.3 3 0.1 0.5 10 0.5 200
## 0.3 3 0.1 0.5 10 1.0 100
## 0.3 3 0.1 0.5 10 1.0 200
## 0.3 3 0.1 1.0 1 0.5 100
## 0.3 3 0.1 1.0 1 0.5 200
## 0.3 3 0.1 1.0 1 1.0 100
## 0.3 3 0.1 1.0 1 1.0 200
## 0.3 3 0.1 1.0 10 0.5 100
## 0.3 3 0.1 1.0 10 0.5 200
## 0.3 3 0.1 1.0 10 1.0 100
## 0.3 3 0.1 1.0 10 1.0 200
## 0.3 6 0.0 0.5 1 0.5 100
## 0.3 6 0.0 0.5 1 0.5 200
## 0.3 6 0.0 0.5 1 1.0 100
## 0.3 6 0.0 0.5 1 1.0 200
## 0.3 6 0.0 0.5 10 0.5 100
## 0.3 6 0.0 0.5 10 0.5 200
## 0.3 6 0.0 0.5 10 1.0 100
## 0.3 6 0.0 0.5 10 1.0 200
## 0.3 6 0.0 1.0 1 0.5 100
## 0.3 6 0.0 1.0 1 0.5 200
## 0.3 6 0.0 1.0 1 1.0 100
## 0.3 6 0.0 1.0 1 1.0 200
## 0.3 6 0.0 1.0 10 0.5 100
## 0.3 6 0.0 1.0 10 0.5 200
## 0.3 6 0.0 1.0 10 1.0 100
## 0.3 6 0.0 1.0 10 1.0 200
## 0.3 6 0.1 0.5 1 0.5 100
## 0.3 6 0.1 0.5 1 0.5 200
## 0.3 6 0.1 0.5 1 1.0 100
## 0.3 6 0.1 0.5 1 1.0 200
## 0.3 6 0.1 0.5 10 0.5 100
## 0.3 6 0.1 0.5 10 0.5 200
## 0.3 6 0.1 0.5 10 1.0 100
## 0.3 6 0.1 0.5 10 1.0 200
## 0.3 6 0.1 1.0 1 0.5 100
## 0.3 6 0.1 1.0 1 0.5 200
## 0.3 6 0.1 1.0 1 1.0 100
## 0.3 6 0.1 1.0 1 1.0 200
## 0.3 6 0.1 1.0 10 0.5 100
## 0.3 6 0.1 1.0 10 0.5 200
## 0.3 6 0.1 1.0 10 1.0 100
## 0.3 6 0.1 1.0 10 1.0 200
## Accuracy Kappa
## 0.7497947 0.4995895
## 0.7323481 0.4646962
## 0.7619048 0.5238095
## 0.7502053 0.5004105
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6687192 0.3374384
## 0.6687192 0.3374384
## 0.7561576 0.5123153
## 0.7210591 0.4421182
## 0.7504105 0.5008210
## 0.7329639 0.4659278
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6744663 0.3489327
## 0.6802135 0.3604269
## 0.7151067 0.4302135
## 0.7261905 0.4523810
## 0.7272167 0.4544335
## 0.7327586 0.4655172
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6744663 0.3489327
## 0.6744663 0.3489327
## 0.7676519 0.5353038
## 0.7559524 0.5119048
## 0.7448686 0.4897373
## 0.7563629 0.5127258
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6744663 0.3489327
## 0.6802135 0.3604269
## 0.7440476 0.4880952
## 0.7500000 0.5000000
## 0.7504105 0.5008210
## 0.7561576 0.5123153
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6629721 0.3259442
## 0.6629721 0.3259442
## 0.7329639 0.4659278
## 0.7327586 0.4655172
## 0.7331691 0.4663383
## 0.7272167 0.4544335
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6744663 0.3489327
## 0.6802135 0.3604269
## 0.7674466 0.5348933
## 0.7557471 0.5114943
## 0.7621100 0.5242200
## 0.7621100 0.5242200
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6863711 0.3727422
## 0.6863711 0.3727422
## 0.7385057 0.4770115
## 0.7502053 0.5004105
## 0.7274220 0.4548440
## 0.7333744 0.4667488
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6744663 0.3489327
## 0.6802135 0.3604269
## 0.7212644 0.4425287
## 0.7157225 0.4314450
## 0.7091544 0.4183087
## 0.7032020 0.4064039
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6629721 0.3259442
## 0.6629721 0.3259442
## 0.7385057 0.4770115
## 0.7323481 0.4646962
## 0.7448686 0.4897373
## 0.7274220 0.4548440
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6800082 0.3600164
## 0.6800082 0.3600164
## 0.7155172 0.4310345
## 0.7270115 0.4540230
## 0.7387110 0.4774220
## 0.7387110 0.4774220
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.7036125 0.4072250
## 0.7036125 0.4072250
## 0.7034072 0.4068144
## 0.7153120 0.4306240
## 0.7448686 0.4897373
## 0.7448686 0.4897373
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6800082 0.3600164
## 0.6800082 0.3600164
## 0.7376847 0.4753695
## 0.7378900 0.4757800
## 0.7446634 0.4893268
## 0.7389163 0.4778325
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6921182 0.3842365
## 0.6921182 0.3842365
## 0.7446634 0.4893268
## 0.7329639 0.4659278
## 0.7097701 0.4195402
## 0.7153120 0.4306240
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6800082 0.3600164
## 0.6800082 0.3600164
## 0.7325534 0.4651067
## 0.7385057 0.4770115
## 0.7619048 0.5238095
## 0.7619048 0.5238095
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6683087 0.3366174
## 0.6683087 0.3366174
## 0.7442529 0.4885057
## 0.7383005 0.4766010
## 0.7040230 0.4080460
## 0.7040230 0.4080460
## 0.5000000 0.0000000
## 0.5000000 0.0000000
## 0.6800082 0.3600164
## 0.6800082 0.3600164
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3, eta
## = 0.1, gamma = 0.1, colsample_bytree = 1, min_child_weight = 1 and subsample
## = 0.5.
# Printing the best model's details
print(xgb_model_tuned$bestTune)
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 25 100 3 0.1 0.1 1 1 0.5
# Plotting model performance
plot(xgb_model_tuned)
xgb_model_tuned <- train( label ~ ., data = downsampled_train_data, method = “xgbTree”, trControl = train_control, tuneGrid = tune_grid_xgb, metric = “Accuracy” # Performance Metric ) print(xgb_model_tuned)
>>>>>>> 93663ae (Add theme and format)print(xgb_model_tuned$bestTune)
plot(xgb_model_tuned)
After hyperparam tuning we…
Predictions and Confusion Matrix We use the trained models to make predictions on the test set and evaluate their performance using confusion matrices.
# Ensure column names in the test set match the training data
colnames(test_data) <- colnames(downsampled_train_data)[1:(ncol(downsampled_train_data) - 1)] # Exclude the label column
# Check if the label column is present and correctly populated
if ("label" %in% colnames(combined_data) && length(combined_data[-train_index, "label"]) == nrow(test_data)) {
# Assign the label to test_data
test_data$label <- combined_data[-train_index, "label"]
} else {
stop("The label vector is empty or has a different length than expected. Check the data preparation steps.")
}
# Ensure the label is a factor with the correct levels
test_data$label <- factor(test_data$label, levels = c("OSN", "Other"))
# SVM Predictions on the test set
svm_test_pred <- predict(svm_model, newdata = test_data[, -ncol(test_data)], probability = TRUE)
svm_test_prob <- attr(svm_test_pred, "probabilities")[, "OSN"]
# Random Forest Predictions on the test set
rf_test_pred <- predict(rf_model, newdata = test_data[, -ncol(test_data)], type = "prob")[, "OSN"]
# SVM Confusion Matrix on the test set
svm_test_conf_matrix <- table(Predicted = ifelse(svm_test_prob > 0.5, "OSN", "Other"), Actual = test_data$label)
print("SVM Test Confusion Matrix:")
## [1] "SVM Test Confusion Matrix:"
print(svm_test_conf_matrix)
## Actual
## Predicted OSN Other
## OSN 7 194
## Other 2 614
# Random Forest Confusion Matrix on the test set
rf_test_conf_matrix <- table(Predicted = ifelse(rf_test_pred > 0.5, "OSN", "Other"), Actual = test_data$label)
print("Random Forest Test Confusion Matrix:")
## [1] "Random Forest Test Confusion Matrix:"
print(rf_test_conf_matrix)
## Actual
## Predicted OSN Other
## OSN 8 204
## Other 1 604
<<<<<<< HEAD
=======
To further assess model performance, we plot the ROC curves and calculate the AUC for both the SVM and Random Forest models.
# Evaluate the models on the test set (AUC and ROC)
svm_test_roc <- roc(test_data$label, svm_test_prob)
## Setting levels: control = OSN, case = Other
## Setting direction: controls > cases
plot(svm_test_roc, main = "SVM ROC Curve")
print(paste("Final SVM Test AUC:", auc(svm_test_roc)))
## [1] "Final SVM Test AUC: 0.766914191419142"
rf_test_roc <- roc(test_data$label, rf_test_pred)
## Setting levels: control = OSN, case = Other
## Setting direction: controls > cases
plot(rf_test_roc, main = "Random Forest ROC Curve")
<<<<<<< HEAD
print(paste("Final Random Forest Test AUC:", auc(rf_test_roc)))
## [1] "Final Random Forest Test AUC: 0.8373899889989"
Model Improvement: Further hyperparameter tuning using automated methods like grid search or random search could refine the model’s accuracy. Additionally, exploring ensemble methods that combine predictions from several models might yield better results.
Data Expansion: Incorporating additional omics layers, such as metabolomics or additional transcription factor binding profiles, could help to improve the model’s predictive power and generalisability.
Integration with Clinical Data: Linking omics profiles with clinical outcomes could also be explored to improve the translational impact of the research.
Model Improvement:
Further hyperparameter tuning using automated methods like grid search
or random search could refine the model’s accuracy. Additionally,
exploring ensemble methods that combine predictions from several models
might yield better results.
Data Expansion:
Incorporating additional omics layers, such as metabolomics or
additional transcription factor binding profiles, could help to improve
the model’s predictive power and generalisability.
Integration with Clinical Data:
Linking omics profiles with clinical outcomes could also be explored to
improve the translational impact of the research.